Progress Memo 2

Final Project
Data Science 1 with R (STAT 301-1)

Author

Chelsea Nelson

Published

December 8, 2023

Github Repo Link

Progress Summary

Data Wrangling

In terms of wrangling and fixing up my data before I officially using it, I made sure that I tackled on how I was going to work around the missingness that was in my dataset. After further investigation, I realized that all of the missing values for my variables corresponded to one specific county and its multiple different family cases. Thus in this case, I decided it would be best to fully remove the observations of that particular county from my dataset, as I felt leaving it in would case more problems in terms of furthering my analysis than taking it out. Thus below I have shown the variables that previously had the instances of missingness to further assure that the missingness in my data in gone. Thus I can move forward in my analysis without the worry of navigating around missingness in particular variables that I want to use.

variable n_miss pct_miss
median_family_income 0 0
num_counties_in_st 0 0
st_cost_rank 0 0
st_med_aff_rank 0 0
st_income_rank 0 0

After navigating through how I was going to handle missingness in my dataset, I move forward onto adding variables that I feel will help create different questions and associations in my analysis. These added variables include one which provides information on the minimum wage in each state for 2022 (minimum_wage) 1, whiles the other showcases the associated geographical region of each state (region). 2 Additionally, I altered the variable type of the metro variable, to be representative as a factor with 0 being that the county is located in a non-metropolitan area, and with a 1 representing the county being located in a metropolitan area. Below I have provided the updated version of my dataset, including both my new variables. At this point, I still plan to add the racial majority makeup for each county, however I am still looking for ways that I can easily input this information without it being too much of a hassle. Thus currently, I have furthered my analysis without this information, but have left spaces and plan to still include it in my final project.

Current EDA

Univariate Analysis

In terms of my univariate analysis, I looked at both the categorical and numerical variables, finding the most interesting statistics and figures within my analysis of the numerical variables.

However before looking into my numerical variables, I believe it is important to highlight the difference in the amount of nonmetro areas to metro areas in the dataset to gauge if this geographical difference will have any impact of how I view and analyze my findings in the future.

Figure 1: Looking at Metro Status of Counties

Above in Figure 1 we see that there are a lot more instances of counties being in non-metropolitan areas than to that of metropolitan areas. I am interested to see how this will affects aspects such as transportation and healthcare as there are heavy implications on how being further from a metro area can cause for more travel to gain necessitate items sometimes as well as it seems that people who are further away from hospitals or don’t have such as an abundance of hospitals to them as though in extremely urban and metro areas, might go to the hospital less often. So I am really excited to look more into these relationships. Additionally, from this we could then also compare metropolitan areas in the south to that of to the north and same with non-metropolitan areas in each region to gauge if geographical region matters more than metro status or vice versa.

Looking at my numerical variables, I want to focus on and expand my research mostly on the variables looking at the total annual, total monthly, healthcare annual, healthcare monthly, housing monthly and housing annual costs. For me these are the greatest variables in terms of finding differences between the counties on the micro and macro levels. Below I will be providing a brief explanation of the distribution of each variable at the national (univariate) level, and I hope as I go further into my bivariate and mutlivariate analyses, I will expand on this to regional and state levels.

Figure 2: Looking at Annual Costs

Looking at Figure 2 we see that the distribution of healthcare annual costs has a extremely large spread in comparison to the other variables at the annual level. Within that plot, there is seems to be a symmetric unimodal or even could be said mutlimodal shape with most average costs of healthcare on the annual level being around $12000. However even outside of this average value, there are still smaller significant subgroups consisting of average healthcare costs being around $6000 and $20000. I feel that there are so many potential reasons for this large spread in healthcare annual costs, that I would love to look further into, such as family size and location, as well as how the minimum wage rate and median family income relate to these higher costs in healthcare. Expanding on this we then can look at distribution of annual housing cost and we see that there is a symmetric right-skewed distribution as most people tend to spend around $12000 on housing annually. I am surprised that there isn’t a larger spread, as I know that housing in cities tend to be more expensive than housing in non-metropolitan areas, thus I hope to see if I can actually find this distinction in my research. Lastly, in terms of the annual variables, the distribution of annual total costs spent on a nationwide level as a bimodal and slightly right-skewed shape as on average most people spend around $60000 a year. Within this plot of total annual expenses, we see that although we have our average value, there is a lot of spread and variation away from this average that we most account for. I hope to do this by looking at how the total annual expenses change state by state, while also perhaps seeing how the different expenses within the total are accounted for differently as well.

Figure 3: Looking at Monthly Costs

Turning our attention the distributions of the same variables above but now at the monthly level, we see similar distributions trends to those in which I pointed out before. For example, looking at the distribution of healthcare costs monthly, we see lot of variability in the average expenses that healthcare is monthly, alongside a mutlimodal slightly right-skewed shape, with an average cost around $1200 a month. In terms of monthly housing expenses, the plot showcases that on average people spend about $900 on housing, with some special cases of people spending over $2000 a month, as our distribution produces a unimodal right-skewed shape. Lastly looking at total monthly expenses, we also see a pretty large spread in the amount of that family types spend monthly at the national level, with the average being around $7000 and shape in the distribution of unimodal and right-skewed. From each of theses distributions, I hope to go further and see why certain variables have such large distributions in comparison to others, as well as the relationship between each of these variables, alongside variables like metro status and region.

Bivariate Analysis

As I made my move to bivariate analysis, I am still looking at the national level assessing to see now if there was any differences between the different types of expenses, both at the monthly and annual levels, in relation with metro status, family type, and geographical region. Overall, I found that for all three characteristics, the difference expenses did not seem to have much to any differences between them. Thus, I think that at a national level we can conclude that there is little to no difference between expenses in terms of metro status, family type, and geographical region, thus meaning that if there are differences present they would be more visual at the regional, state, and county regions. Below, I have provided figures of the relationship between total expenses for both annually and monthly to that of family type, metro status and geographical region to provide visual evidence to my claim.

Figure 4: Total Annual Expenses based on family type, metro status, and geographical region

Figure 5: Total Monthly Expenses based on family type, metro status, and geographical region

Looking at each set of plots, we see that the only real different currently is when we look at the different family types, the total expenses (both annual and monthly) distributions move more and more to the left, increasing. This makes senses as larger families tend to have more expenses. Otherwise on the geographical and metro status levels, we see little to no differences between total annual expenses. Thus moving the analysis more to the microlevel will help to find those currently hidden differences.

Main findings so far

So far, I have only look at these relationship on the national level. However, in terms of next steps moving to the regional and state level are next in order to find more compelling results.

One of the main findings that I have found so far is that at the national level, there seems to be a lot of disparities in terms of how much people spend within their family types on healthcare, both at the annual and monthly levels. This is surprising because compared to the other variables, healthcare had increasingly more variation in its distribution bring attention to what could be the possible factors pertaining to this. Currently, I wonder if state policies play into this as well as the type of insurances that people have. Additionally, I wonder if this is a regional inconsistency or due to different metro status. These questions are ones I hope to expand on as I further my analysis.

Another main finding is that through my early bivariate analysis, I concluded that the national level is too broad to find differences between expenses based on the categorical variables in which we relate them to. Thus to find statistically significant evidence we have to look more at microlevels and then perhaps expand our findings to the macrolevel rather than looking at the macro and then basing the micro off of those nonexistent findings.

Questions that I have created

Currently I want to look at the main factors causing for the large spread in the our distributions of healthcare both on the monthly and annual level, while also seeing where the breakdown of the total expenses on the monthly and annual levels might break down based on region or state.

Additionally, I would like to see if families in states with higher minimum wages better able to meet their budget requirements, while also assessing if families in states with robust public transportation systems able to allocate less of their budget to transportation costs, with more of the expenses turning towards housing instead.

Furthermore, I am interested in looking at, if I implement this variable, how racial majority of an county corresponds to the main expenses of the citizens as well as the amounts that they spend. Within this I think it would be interesting to look at state policies and tax expenses as well and see if we can draw conclusions based on these factors as well.

Next Steps

In terms of next steps, I hope to do some more work on looking statistically at correlations between the main variables in which I want to look at, beginning to answer the questions that I wish to assess. Additionally, I hope to create linear models to ensure these relationships as well. Furthermore, I want to look more at state specifically relationships, making my way towards doing mutlivariate analysis to really find statistically significant results for differences in expenses on the levels of county to county, state to state, and region to region.

I want to still add in my racial majority census data into this dataset as well so that I can further analyzed my variables on different levels accounting for multiple reasons in which expenses in a county are the way they are.

Lastly, I plan to make a codebook for my dataset after completing my data collection in terms of adding the racial majority of each county. While also working on creating my readme files for my github repo.

Footnotes

  1. This information was sourced from Paycom 2023 Guide to Every State’s Minimum Wage.↩︎

  2. This information was sourced from Census Regions and Divisions of the United States.↩︎